Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

test: Add unit tests to test multiple files in single dataset #412

Merged

Conversation

Abhishek-TAMU
Copy link
Collaborator

@Abhishek-TAMU Abhishek-TAMU commented Dec 10, 2024

Description of the change

1- Added unit test case for verifying handling of multiple data files of 1 DataSetConfig passed via data_config.
2- Added unit test case for verifying handling of multiple data files of 1 DataSetConfig with (different format) passed via data_config.
3- Added unit test case for verifying handling of multiple data files of 1 DataSetConfig with (different type) passed via data_config.
4- Added unit test case (by @willmj) for verifying handling of multiple data files of multiple DataSetConfig with (different format) passed via data_config.
5- Data files in the tests/artifacts/testdata directory have been organized by file format for better categorization.
6- Unit tests in test_sft_trainer.py for e2e testing:

  • Existing Full FT unit tests with Arrow and Parquet Datafile format
  • Add new Full FT unit test with Multiple Datasets with Multiple files

Related issue number

https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1487

How to verify the PR

Verify unit test additions.

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

1 similar comment
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the test label Dec 10, 2024
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
@Abhishek-TAMU Abhishek-TAMU marked this pull request as ready for review December 11, 2024 18:21
willmj
willmj previously approved these changes Dec 11, 2024
Copy link
Collaborator

@willmj willmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I like splitting test data into different folders by type. Have a couple of comments below. I'm interested to hear Dushyant's thoughts before merging

Comment on lines +776 to +779
[
TWITTER_COMPLAINTS_DATA_INPUT_OUTPUT_JSONL,
TWITTER_COMPLAINTS_DATA_INPUT_OUTPUT_JSONL,
],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be a benefit in using multiple types of data here such as JSONL for one dataset, Arrow for another, and Parquet for a third?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May not be necessary, Good to have it though. Added!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been added? I do not see a mix of the different type of data @willmj asked here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this was added. I can see it in commit changes here. Let me know of other changes required in this unit test.

),
],
)
def test_process_dataconfig_multiple_files(data_config_path, list_data_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding a test with three files just in case

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this test case already have multiple cases, added case with 3 files in same unit test for all 3 handlers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reduce the test cases number and possibly just have

  1. Mix of all three -> 1 test
  2. each dataset multiple files, either 2 or three, maybe in a random mix

I think 3-4 scenrios should be fine, rest are anyway similar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see there is one with varied data formats below...so maybe just a reduction of number of tests here could also work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have optimized the unit tests based on below comments. If we still need to optimize more do let me know here. @dushyantbehl

Comment on lines +93 to +94
if dataset_text_field not in element:
raise KeyError(f"Dataset should contain {dataset_text_field} field.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch!

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
),
],
)
def test_process_dataconfig_multiple_datasets_datafiles(datafiles, datasetconfigname):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case looks good. Updated the description. Thanks @willmj

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this test be combined with this one?

def test_process_dataset_configs_with_sampling(datafiles, datasetconfigname):

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Combined into test_process_dataconfig_multiple_datasets_datafiles_sampling.

.pylintrc Outdated
@@ -333,7 +333,7 @@ indent-string=' '
max-line-length=100

# Maximum number of lines in a module.
max-module-lines=1200
max-module-lines=1400
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we will keep hitting this. I also had to disable this specifically for test_sft_trainer.py

https://github.com/foundation-model-stack/fms-hf-tuning/pull/409/files#diff-406c3c4d02c1ff158009b04d26a6b2d357f05d11e8c35c3f2ddfc1b54649a022R18

Copy link
Collaborator Author

@Abhishek-TAMU Abhishek-TAMU Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Removed it after merging in the change from the mentioned PR.

@@ -19,37 +19,47 @@

### Constants used for data
DATA_DIR = os.path.join(os.path.dirname(__file__))
JSON_DATA_DIR = os.path.join(os.path.dirname(__file__), "json")
Copy link
Contributor

@dushyantbehl dushyantbehl Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this segregation.

@@ -491,6 +497,284 @@ def test_process_dataconfig_file(data_config_path, data_path):
assert formatted_dataset_field in set(train_set.column_names)


@pytest.mark.parametrize(
"data_config_path, list_data_path",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we change the variable name list_data_path to data_path_list

),
],
)
def test_process_dataconfig_multiple_files_varied_types(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be combined with the above test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this is added! Thank you!

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
@Abhishek-TAMU
Copy link
Collaborator Author

Thank you @dushyantbehl for the review. Have addressed the changes. Feel free to have a look once again.

Copy link
Contributor

@dushyantbehl dushyantbehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@willmj willmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Abhishek-TAMU Abhishek-TAMU merged commit 4441948 into foundation-model-stack:main Dec 13, 2024
8 checks passed
@Abhishek-TAMU Abhishek-TAMU deleted the multiple_files branch December 13, 2024 16:41
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants